Skip to content

Validate checkpoint GPU UUID inputs early#2086

Open
aryanputta wants to merge 6 commits into
NVIDIA:mainfrom
aryanputta:fix-checkpoint-gpu-uuid-validation
Open

Validate checkpoint GPU UUID inputs early#2086
aryanputta wants to merge 6 commits into
NVIDIA:mainfrom
aryanputta:fix-checkpoint-gpu-uuid-validation

Conversation

@aryanputta

Copy link
Copy Markdown
Contributor

Summary

  • reject non-UUID gpu_mapping values in cuda.core.checkpoint before they reach driver-facing restore args
  • normalize malformed UUID string failures to the documented ValueError path
  • add source-level checkpoint helper tests that run without importing the built cuda.core package

Testing

  • python3 -m pytest cuda_core/tests/test_checkpoint_helpers.py --noconftest

Signed-off-by: Aryan <aryansputta@gmail.com>
@copy-pr-bot

copy-pr-bot Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the cuda.core Everything related to the cuda.core module label May 14, 2026
Comment thread cuda_core/tests/test_checkpoint_helpers.py Outdated
Comment thread cuda_core/cuda/core/checkpoint.py
@aryanputta

aryanputta commented May 16, 2026

Copy link
Copy Markdown
Contributor Author

@leofang You were right on both points here.
What I changed in commit 083e411:

  • removed the mock-only cuda_core/tests/test_checkpoint_helpers.py
  • narrowed _as_cuuuid() so restore mappings now accept UUID strings only
  • updated the docs to say gpu_mapping should use the strings returned by Device.uuid

On your question specifically... I dont' have a strong public facing use case for accepting CUuuid objects here. The intended caller-facing path is the string UUID returned by Device.uuid, and the existing real checkpoint tests already follow that contract. For example, cuda_core/tests/test_checkpoint.py builds migration mappings from devices[i].uuid in _build_rotation_mapping() and runs the driver-backed checkpoint/restore scenarios without mocks.

Comment thread cuda_core/cuda/core/checkpoint.py Outdated
@leofang leofang added enhancement Any code-related improvements P1 Medium priority - Should do labels Jun 8, 2026
@aryanputta

Copy link
Copy Markdown
Contributor Author

@leofang I pushed the requested CUuuid compatibility restore in 3261c10. The metadata check is still failing because the PR has no milestone, and GitHub does not let me set one on the upstream PR. Could you set the milestone to cuda.core v1.1.0 when you get a chance?

@leofang leofang added this to the cuda.core v1.1.0 milestone Jun 8, 2026
@leofang

leofang commented Jun 8, 2026

Copy link
Copy Markdown
Member

/ok to test 3261c10

@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

Signed-off-by: Aryan Putta <aryansputta@gmail.com>
@aryanputta

Copy link
Copy Markdown
Contributor Author

@leofang I repushed this as 1520ee3 to retrigger CI because I do not have permission to rerun failed jobs on the upstream repo. I inspected the previous run before repushing: the failures looked unrelated to the checkpoint change (aarch64 sccache/GitHub cache 503, WSL NVML process-name UnicodeDecodeError, Windows jit_lto_fractal driver/backend mismatch, and one IPC flake). Could you approve the new SHA with /ok to test when convenient?

@aryanputta

aryanputta commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

@leofang quick update: I moved this branch to current main as 41507c5 after confirming the previous WSL get_process_name failures were covered by #2118. The effective diff against current main is still only cuda_core/cuda/core/checkpoint.py. Could you approve the new SHA with /ok to test when convenient?

@leofang

leofang commented Jun 8, 2026

Copy link
Copy Markdown
Member

@aryanputta plz be patient and wait for one of us to check the CI. We can trigger re-runs of the failing jobs only, but if you merge with main we have to start from beginning and run all jobs, not just the failing one.

@leofang

leofang commented Jun 8, 2026

Copy link
Copy Markdown
Member

Also our CI is known flakey for a few things and rerun usually solves the problem.

@aryanputta

Copy link
Copy Markdown
Contributor Author

@aryanputta plz be patient and wait for one of us to check the CI. We can trigger re-runs of the failing jobs only, but if you merge with main we have to start from beginning and run all jobs, not just the failing one.

Sorry about that, @leofang! I was told in the past not to force-push/rebase on shared branches to avoid disrupting reviewers, so I thought moving it to main would be the cleaner approach. I will definitely keep a note for myself to just leave it be and wait for a maintainer to trigger the specific job reruns next time.

Also, just wanted to let you know that I fixed the other PR #2074 we were working on whenever you have some time to check it out.

Thank you!

@leofang

leofang commented Jun 9, 2026

Copy link
Copy Markdown
Member

/ok to test 1ff00e5

@leofang

leofang commented Jun 9, 2026

Copy link
Copy Markdown
Member

I was told in the past not to force-push/rebase on shared branches to avoid disrupting reviewers

That is correct, for the very same reasons: When the CI pipelines are triggered, it's based on a particular snapshot (commit). Any action that results in a new commit (ex: merge with main, rebase, force-push, ...) would forfeit the opportunity of re-running.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cuda.core Everything related to the cuda.core module enhancement Any code-related improvements P1 Medium priority - Should do

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants